DATA SCIENCE SESSIONS VOL. 3

A Foundational Python Data Science Course

Session 16: Generalized Linear Models I. Binomial Logistic Regression and its MLE. Multinomial Regression

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


Generalized Linear Models

We will now begin to introduce a set of extremely useful statistical learning models. Their name - Generalized Linear Models (GLMs) - suggests that their are somehow related to Simple and Multiple Linear Regression models and yet somehow go beyond them. That is correct: GLMs generalize the linear model, where predictors and their respective coefficients produce a linear combination of vectors, by introducing link functions to solve those kinds of problems that cannot be handled by Linear Regression. For example, what if the problem is not to predict a continuous value of the criterion, but the outcome variable is rather a dichotomy and then the problem becomes the one of categorization? E.g. predict the sex of a respondent given a set ${X}$ of their features? Enters Binomial Logistic Regression, the simplest GLM.

Another thing: GLMs cannot be estimated by minimizing the quadratic error as we have estimated Simple and Multiple Linear Regression in the previous Session15. The method used to estimate Linear Models is known as Least Squares Estimation. To fit GLMs to our data, we will introduce the concept of Likelihood in Probability Theory and learn about the Maximum Likelihood Estimate!

What happens when the assumptions of the Linear Model fail?

Let us briefly recall the assumptions of the (Multiple) Linear Regression model:

What if we observe a set of variables that somehow describe a statistical experiment that can result in any of the two discrete outcomes? For example, we observe a description of a behavior of a person, quantified in some way, and organized into a set of variables that should be used to predict the sex of that person? Or any other similar problem where the outcome can take only two values, say 0 or 1 (and immediately recall the Binomial Distribution)?

The assumptions of the Linear Model obviously constrain its application in such cases. We ask the following question now: would it be possible to generalize, or expand, modify the Linear Model somehow to be able to encompass the categorization problem? Because it sounds so appealing to be able to have a set of predictors, combine them in a linear fashion, and estimate the coefficients so to be able to predict whether the outcome would turn this way or another?

There is a way to develop such a generalization of the Linear Model. In its simplest form it represents the Binomial Logistic Regression. Binomial Logistic Regression is very similar to multiple regression, except that for the outcome variable is now a categorical variable (i.e. it is measured on a nominal scale that is a dichotomy).

1. Binomial Logistic Regression

Let's recall the form of the Linear Model with any number of predictors:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon$$

So we have a linear combination of $k$ predictors $\boldsymbol{X}$ plus the model error term $\epsilon$ on the RHS, and the outcome variable $Y$ on the LHS.

Now we assume that $Y$ can take only two possible values, call them $0$ and $1$ for ease of discussion. We want to predict whether $Y$ will happen to be ($1$) or not ($0$) given our observations of a set of predictors $\boldsymbol{X}$. However, in Binary Logistic Regression we do not predict the value of the outcome itself, but rather the probability that the outcome will turn out $1$ or $0$ given the predictors.

In the simplest possible case, where there is only one predictor $X_1$, this is exactly what we predict in Binary Logistic Regression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1)}}$$

where $\beta_0$ and $\beta_1$ are the same old good linear model coefficients. As we will see, the linear coefficients have a new interpretation in Binary Logistic Regression - a one rather different that the one they receive in the scope of the Linear Model.

With $k$ predictors we have:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Now the above equations looks like it felt from the clear blue sky to solve the problem. There is a clear motivation for its form, of course: imagine that instead of predicting the state of $Y$ directly we decide to predicts the odds of $Y$ turning out $1$ instead of $0$:

$$odds = \frac{p_1}{1-p_1}$$

Now goes the trick: if instead of predicting the odds $p_1/(1-p_1)$ we decide to predict the log-odds (also called: logit) from a linear combination of predictors

$$log \left( \frac{p_1}{1-p_1} \right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$$

it turns out that we can recover the odds by taking the exponent of both LHS and RHS:

$$\frac{p_1}{1-p_1} = e^{(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}$$

and then by following a simple algebraic rearrangement:

(1) let's write $l=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$ for simplicity and use $l$ to replace the whole linear combination in whats follows;

(2) we have $log \left( \frac{p_1}{1-p_1} \right)=l$ then;

(3) taking the exponent of both sides we arive at $\frac{p_1}{1-p_1}=e^l$;

(4) immediately follows that

$\frac{p_1}{1-p_1}=\frac{1}{e^{-l}} \implies p_1=\frac{1-p_1}{e^{-l}} \implies p_1+\frac{p1}{e^{-l}}=\frac{1}{e^{-l}}\implies p_1e^{-l}+p_1=1 \implies p_1(1+e^{-l})=1$

and after rewriting $l$ as a linear combination again we find that the probability $p_1$ of the outcome $Y$ turning out $1$ is:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Now, imagine we set a following criterion: anytime we estimate $p_1$ to be larger than or equal to $.5$ we predict that $Y=1$, and anytime $p_1 < .5$ we predict that $Y=0$. What we need to do in order to be able to learn how to predict $Y$ in this way is to estimate the coefficients $b_0$, $b_1$, $b_2$, etc like we did in the case of a linear model. However, minimizing SSE will not work in this case: our predictions will be on a probability scale, while our observations are discrete, $0$ or $1$. We will have to find another way.

Data Preparation

Target: Predict churn from all numeric predictors

We use Binomial Logistic Regression to predict the probability $p$ of a given observation $\textbf{x}$, with features $(x_1, x_2, \ldots, x_k)$, belonging to one of the two binary categories {0, 1}. We compute these probabilites via

$$p = \frac{1}{1+e^{\beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0}},$$

where $\beta_1, \beta_2, \ldots, \beta_k$ are the model's parameters for the predictors, and $\beta_0$ is the intercept of the model.

In order to determine whether the predicted label $\hat{y}$ for a given observation $\textbf{x}$ has binary label 1 or 0, we impose a decision criterion $\sigma$ - a number in the (0, 1) interval. If $p > \sigma$, then we assign label $\hat{y} = 1$ to the observation $\textbf{x}$; else, we take $\hat{y} = 0$. Ususally, we take $\sigma = 0.5$.

The model is optimized by MLE (Maximum Likelihood Estimation), and the interpretation of the model coefficients is the following:

N.B. There is a bug in the Wald's Z, look:

The reason why the Wald statistic should be used cautiously is because, when the regression coefficient is large, the standard error tends to become inflated, resulting in the Wald statistic being underestimated (see Menard, 1995). The inflation of the standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant contribution to the model (i.e. you are more likely to make a Type II error). From: Andy Field, DISCOVERING STATISTICS USING SPSS, Third Edition, Sage.

Coefficients in BLR

The $\Delta Odds$ (Odds Ratio)

Do not forget that you have transformed your linear combination of model coefficients and predictors into a log-odds space: the logistic regression coefficient $\beta$ associated with a predictor X is the expected change in log(odds).

So, by taking $e^{\beta_i}$, your coefficient now says:

Which means that

The Akaike Information Criterion (AIC)

The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.

The AIC is defined as follows:

$$AIC = -2\ln(\mathcal{L}) + 2k $$

where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.

The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.

The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.

Model effect: comparison to the Null Model

Target: Predict churn from all the predictors

2. BLR using scikit-learn

Target: Predicting churn from numerical predictors

Target: Predicting churn from all the predictors

3. MLE for Binomial Logistic Regression

Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:

$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):

$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$

where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have

$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).

Q: But... how do we get to $p_1$, the parameter value that we will use at each data point? A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.

So what combination of the linear coefficients is the best one?

It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.

Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points. That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).

Thus, while the Likelihood function for the whole dataset would be

$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

the Log-Likelihood function would be:

$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$

And finally here is how we solve the Binomial Logistic Regression problem:

Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.

Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Test blr_predict()

Now define the Negative Log-Likelihood function:

Test blr_nll():

Optimize!

Check against statsmodels

Plot the Negative Log-Likelihood Function

4. Multinomial Regression

The Multinomial Regression model is a powerful classification tool. Consider a problem where some outcome variable can result in more than two discrete outcomes. For example, a customer visiting a webshop can end up their visit (a) buying nothing, (b) buying some Product A, or (c) Product B, or (d) Product C, etc. If we have some information about a particular customer's journey through the website (e.g. how much time did they spend on some particular pages, did they visit the webshop before or not, or any other information that customers might have chose to disclose on their sign-up...), we can use it as a set of predictors of customer behavior resulting in any of the (a), (b), (c), (d). We do that by means of a simple extension of the Binomial Logistic Regression that is used to solve for dichotomies: enters the Multinomial Regression model.

The Model

First, similar to what happens in dummy coding, given a set of $K$ possible outcomes we choose one of them as a baseline. Thus all results of the Multionomial Regression will be interpreted as effects relative to that baseline outcome category, for example: for a unit increase in predictor $X_1$ what is the change in odds to switch from (a) buying nothing to (b) buying Product A. We are already familiar with this logic, right?

So, consider a set of $K-1$ independent Binary Logistic Models with only one predictor $X$ where the baseline is now referred to as $K$:

$$log\frac{P(Y_i = 1)}{P(Y_i = K)} = \beta_{1,0} + \beta_{1,1}X$$$$log\frac{P(Y_i = 2)}{P(Y_i = K)} = \beta_{2,0} + \beta_{2,1}X$$$$log\frac{P(Y_i = K-1)}{P(Y_i = K)} = \beta_{K-1,0} + \beta_{K-1,1}X$$

Obviously, we are introducing a set of new regression coefficients ($\beta_{k,\cdot}$) for each possible value of the outcome $k = 1, 2,.., K-1$. The log-odds are on the LHS while the linear model remains on the RHS.

Now we exponentiate the equations to arrive at the expressions for odds:

$$\frac{P(Y_i = 1)}{P(Y_i = K)} = e^{\beta_{1,0} + \beta_{1,1}X}$$$$\frac{P(Y_i = 2)}{P(Y_i = K)} = e^{\beta_{2,0} + \beta_{2,1}X}$$$$\frac{P(Y_i = K-1)}{P(Y_i = K)} = e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$

And solve for $P(Y_i = 1), P(Y_i = 2),.. P(Y_i = K-1)$:

$$P(Y_i = 1) = P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X}$$$$P(Y_i = 2) = P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X}$$$$P(Y_i = K-1) = P(Y_i = K)e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$

From the fact that all probabilities $P(Y_i = 1), P(Y_i = 2), .., P(Y_i = K)$ must sum to one it can be shown that

$$P(Y_i = K) = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$

Because:

(1) $P(Y_i = 1) + P(Y_i = 2) + ... + P(Y_i = K) = 1$, we have

(2) $P(Y_i = K) + P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X} + P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X} + ... + P(Y_i = K=1)e^{\beta_{K-1,0} + \beta_{K-1,1}X} = 1$

(3) and then replace $e^{\beta_{1,0} + \beta_{1,1}X}$ by $l_1$, $e^{\beta_{2,0} + \beta_{2,1}X}$ by $l_2$, and $e^{\beta_{k,0} + \beta_{k,1}X}$ by $l_k$ in the general case, we have

(4) $P(Y_i = K) + P(Y_i = K)e^{l_1} + P(Y_i = K)e^{l_2} + ... + P(Y_i = K-1)e^{l_{K-1}} = 1$

(5) $P(Y_i = K)[1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}] = 1$ so that

(6) $P(Y_i = K) = \frac{1}{1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}} = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$

It is easy now to derive the expressions for all $K-1$ probabilities of the outcome resulting in a particular class:

$$P(Y_i = 1) = \frac{e^{\beta_{1,0} + \beta_{1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = 2) = \frac{e^{\beta_{2,0} + \beta_{2,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = K-1) = \frac{e^{\beta_{K-1,0} + \beta_{K-1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$

Target: predict species from all continuous predictors

We use Multinomial Logistic Regression Model to predict the most probable category for the given observation $\textbf{x}$ with features $(x_1, x_2, \ldots, x_k)$. Assume that our target variable $y$ belongs to one of categories from the set $\{1, 2, \ldots, C\}$. In MNR we usually select one category as the referrence category; let that be the category $C$. Then, the probability that the target variable $y$ belongs to category $c = 1,\ldots,C-1$ is calculated via

$$P(y = c) = \frac{e^{\beta^{(c)}_1x_1 + \beta^{(c)}_2x_2 + \cdots + \beta^{(c)}_kx_k + \beta_0}}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$

and the probability that it belogns to the referrence category $C$ is

$$P(y = C) = \frac{1}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$

where $\beta^{(j)}_1, \beta^{(j)}_2, \ldots, \beta^{(j)}_k,\ j=1,\ldots,C$ are the model's parameters for predictors and target variable categories, and $n$ is the intercept of the model.

After calculating all the probabilities $P(y = c),\ c=1,\ldots,C$ we predict the target variable as

$$\hat{y} = \textrm{argmax}_{c=1,\ldots,C}P(y=c).$$

The model is estimated by MLE (Maximum Likelihood Estimation). For each category $c$ - except for the referrence $C$, of course - we obtain a set of coefficients. Each model coefficient, in each category, tells us about the $\Delta_{odds}$ in favor of the target category, for a unit change of a predictor, in comparison with the baseline category, and given that everything else is kept constant.

Multicolinearity in Multinomial Regression

Multinomial Logistic Regression using scikit-learn

N.B. scikit-learn does not implement the referrence category automatically!

Further Reading


DataKolektiv, 2022/23.

hello@datakolektiv.com

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.